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Abstract: Communication requirements of Cholesky factorization of dense and sparse 

symmetric, positive definite matrices are analyzed. The communication requirement is 
characterized by the data traffic generated on multiprocessor systems with local and shared 
memory. Lower bound proofs are given to show that when the load is uniformly distributed 
the data traffic associated with factoring an n x n dense matrix using n a , a < 2, processors 
is 0 (n 2+ °/ 2 ). For an n x n sparse matrices representing a ^Jn x \fn regular grid graph 
the data traffic is shown to be n(n. 1+a / 2 ), a < 1. 

Partitioning schemes that are variations of block assignment scheme are described and 
it is shown that the data, traffic generated by these schemes are asymptotically optimal. 
The schemes allow efficient use of up to 0(n 2 ) processors in the dense case and up to 0(n ) 
processors in the sparse case before the total data traffic reaches the maximum value of 
0(n 3 ) and 0(n 3 / 2 ), respectively. It is shown that the block based partitioning schemes 
allow a better utilization of the data accessed from shared memory and thus reduce the 
data, traffic than those based on column-wise wrap around assignment schemes. 
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1. Introduction 


Considci the problem of solving a system of linear equations 


Ax = b 


where A is an n x n symmetric, positive definite coefficient matrix, x is an n x 1 vector of 
variables, and b is an n x 1 vector of constants. Applying the Cholesky decomposition to 
A vields 

A = LL r 

where L is lower triangular with positive diagonal elements [10], From this factorization, 
the solution to the system of equations is obtained by solving the triangular systems 

Ly -b 


and 

L r x = y. 

Recently, efforts have been reported for efficiently parallelizing the various steps in comput- 
ing the solution of the dense and sparse systems. Most of this work has concentrated on 
developing algorithms that extract as much parallelism as possible on specific architectures 
[14,18,19,1,5,3,9]. The main emphasis there is on distributing the computational load as 
evenly among the processors as possible and little attention is paid towards the data traffic 
complexity. 

In this paper we are interested in the parallel Cholesky decomposition schemes with mini- 
mum data traffic for factoring dense and sparse symmetric, positive definite matrices. The 
model of computation assumed for this purpose is that of a multiprocessor system with two 
level memory hierarchy such that each processor has local memory and all processors have 
access to a common shared memory. Accessing any nonzero element in the shared memory 
is assumed to generate a unit data traffic. No data traffic is generated in accessing the 
local memory. The total number of shared memory accesses from the beginning to the end 
of the algorithm is defined as the communication requirement or total data traffic of that 
algorithm implemented on the multiprocessor system. 

In [17] the communication requirement of the Gaussian elimination algorithm implemented 
on three different architectures is analyzed. For a bus architecture where a data element 
may be broadcast to all the processors in one step and counts as one unit data traffic 
independent of the number of processors receiving the data, the data traffic complexity is 
showm to be fl(n 2 ). For a nearest neighbor ring network, w T here each transmission of a data 
element across a link of the ring counts as one unit data traffic, the data traffic complexity 
is shown to be Q(n 2 *p), where p is the number of processors on the ring. The data traffic 
complexity for a nearest neighbor grid network is showm to be n(n 2 v /p). In all the cases 
it is assumed that no element is computed in more than one processor; i.e., recomputation 
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is not permitted. By using a different proof technique than that given in [17], it is shown 
here for the assumed model of computation that the data traffic complexity of the Cholesky 
factorization scheme is y/p). The proof for the lower bound holds even if recomputation 
is allowed provided each processor is assigned at least n^/Sp amount of work. Although we 
do not prove it, our result holds for other variations of Gaussian elimination as well. 

In [4], a parallel sparse factorization scheme is given for local memory multiprocessor sys- 
tems. This scheme has a total data, traffic of 0(n l + a log 2 n) using n a processors. This result 
is improved to 0(u 1 * fnr ) in [7]. In this paper we present a factorization scheme that has a 
total data traffic of 0(n l+Of / 2 ). Our main results and the organization of the paper is as 
follows. 

In the following section, the data dependencies involved in the Cholesky factorization of a 
dense matrix are discussed and a parallel assignment scheme is presented. It is shown that 
the data traffic associated with that scheme is 0(n 2 * y/p) when an n x n dense matrix is 
factored using p processors. By giving a proof on the lower bound for the data traffic, in 
Section 2.4 it is shown that under the condition of uniform load distribution the computa tion 
time and data traffic complexities of the assignment scheme are asymptotically optimal. In 
Section 3., the case of factoring sparse, symmetric, positive definite matrices is considered. 
The sparse matrices considered here are restricted to only those matrices that represent the 
graphs arising in finite difference and finite element applications. In Section 3.5, a block 
based parallel factoring scheme for sparse matrices is presented. The data traffic in factoring 
an 7 i x n sparse matrix corresponding to a 2-dimensional regular grid graph is shown to be 
0(n • y/p ). In Section 3.6 a lower bound on the data traffic in factoring the sparse matrix 
is shown to be Q(n • y/p ). These results can be extended to other graphs that satisfy an 
/(n)-separator theorem [13]. Preliminary versions of the results given here appear in [15] 
and [16]. 

For the sake of clarity, in the following discussion the dense matrix is assumed to be of size 
m x m and the sparse matrix of size n x n. 


2. Parallel factoring of dense symmetric, positive definite 
matrices 


The basic algebraic scheme considered here for factoring an m x m symmetric, positive 
definite matrix A is the column version of the Cholesky decomposition method [10]. An 
outline of this algorithm is given next and the data dependencies are discussed. Following 
that a partitioning scheme with optimal data, traffic is presented. 

In the following discussion, values in row i refer to the values of the elements on and to the 
left of the diagonal. Similarly, o,,* (a.j) represents all the elements in row i (in column j) 
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of the lower triangular part of the matrix under consideration. 


2.1 The Cholesky factorization 

Let A = LL r \ (i{j € A and / t j E L. 

for j = 1 until m do 
begin 

Initialize Uj = i = ;,•••, m 

for Jfc = 1 until j — 1 do 
for i = j until m do 

= ~ * (j\* » 

hj = A7 j 

for jb = j + 1 until m do 
end 


In the above algorithm for clarity, the values of hj are shown separately from those of a, } j. 
In practice /,j may overwrite a>ij. 

Clearly, in the Cholesky factorization scheme outlined above, computing the elements in 
a column j of L requires values of the elements in columns 1 through j — 1 of L and the 
values of the off-diagonal elements in j are used for computations of elements in columns 
jf + 1 through m of L. Specifically, computing an element Uj in L requires all the values 
from the set, 

A ij = I 1 <j'< j} u | 1 < j' < j } u 

Moreover, the steps of the innermost loop, where a product of two elements of L is sub- 
tracted from o,-j, may be performed in any order. Once /,,j is computed, it is used in the 
computation of every element in the set, 

A;,j = {U,j' I j < j < *"} U | * "^ J ^ m }- 

Again the element 1,-j may be used in any order in the computations of the elements in the 
set A i t j. 
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2.2 A partitioning scheme for Cholesky factorization 


Without loss of generality, assume that p — (r 2 + r)/ 2 where r is an integer. The lower 
triangular part of the matrix A is divided into p partitions by taking r vertical and r 
horizontal sections each of size 8 , where u = m/r. All except r of the resulting p partitions 
are square blocks of size s x 8. The remaining r partitions which lie on the diagonal of the 
matrix are s x s triangular blocks. Each of the partitions is assigned to a single processor. 
Initially, each processor reads the data for its partition from shared memory into its local 
memory. The computations proceed in parallel according to the column version Cholesky 
algorithm as follows. The r processors in charge of the partitions containing the left most 
8X8 blocks of the matrix commence the computations of their part of the factorization. 
As soon as an element of the factor is computed, it is written into shared memory for access 
by other processors. As the necessary data becomes available, the remaining processors 
initiate computations on the blocks assigned to them. This is continued until the entire 
factor is computed and written into the shared memory. This partitioning and factorization 
scheme for dense symmetric, positive definite matrices is referred to as the block oriented 
column Cholesky-i actorization scheme or simply as the BLOCC scheme. 



Figure 1: The data traffic associated with block I 

2.3 Data traffic complexity of the BLOCC scheme 

First consider the data traffic associated with computing the elements in a generic square 
block I in the factor L shown in Figure 1. In that figure, the darkened area represents the 
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data elements that are required for the computations in block I. Let block I be bounded by 
the columns (i — l)s 4- 1 and i -a, and by the rows (j — l)s + 1 and j • s where 1 < i < r and 
1 < j < r. The following lemma provides bounds for the communication cost of block I. 


Lemma 1 A total data traffic of (2i -1/2)* s 2 + s/2 is necessary and sufficient for computing 
the elements in the square block I; it is the same for all square blocks bounded by the columns 
(r — l)s + 1 and i s. The data traffic associated with a 8 X s triangular diagonal block 
bounded by the columns ( i — 1 )s -f 1 and i • s } is ( i — 1/2) • s 2 + s/2. 


Proof : See [15]. 


I 


Using these results, a bound on the total data traffic is obtained, as shown next. 


Theorem 1 The total data traffic associated with the BLOCC scheme for factoring an 
m x m dense symmetric , positive definite matrix using p processors is 0(m 2 y/p). 


Proof : The total data traffic associated with all the blocks bounded by columns (i — l)s+ 1 
and i - 5, 1 < i < r, is given by, 

(r — i) X (data traffic associated with a square block) 

+ (data traffic associated with a triangular block bounded by the given columns). 

From Lemma 1 and the fact that there are r such column partitions, we get the total data 
traffic involved in factoring the m x m dense matrix using the BLOCC scheme as: 

£ ((r - »)((2« - 1/2) • * 2 + s/2) + ((« - 1/2) • * 2 + s/2)). 

i=l 

Ignoring the lower order terms, the total data traffic is 

= £(2r.»- 2 i 2 ) • a 2 

i=i 

< r 3 • « 2 /3. 

Since p = (r 2 + r)/2 and s = m/r, the total data traffic is 0(m 2 y/p). I 


2.4 A lower bound on the data traffic complexity 


Theorem 1 gives an upper bound on the data traffic associated with factoring the m x m 
matrix using the BLOCC assignment scheme. In the next theorem a lower bound on the 
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data traffic in factoring the dense symmetric, positive definite matrix is established. Before 
giving the proof, we again consider the data dependencies involved in computing an element 
of the factor. Consider the computations at any element a;j as shown in Figure 2. To 
compute the corresponding element /;j, values at all the elements in row j and the values 
of elements in column 1 through j of row i are needed. Thus, if a;j is an off diagonal 
element then 2 j values are needed for the computations and if it is a diagonal element (i.e., 
« = j) then j values are required. There are three observations to make regarding these 
computations as follows: 


i. The values at all the elements in any row i are needed to complete the computations 
corresponding to the diagonal element no other values are needed for computing 
Moreover, no other element in the factor can be computed by knowing only the values in 
row i. 

ii. If i and j are any two rows such that i > j, then the values of the elements in these two 
rows are used to complete computations at exactly one off-diagonal element a,j. Values 
from no other row are needed to complete the computations at that element. 

iii. For the computations at a subset of elements spread over k T rows and k c columns, values 
from at least max(lfc r ,Jfc c ) rows are needed. 



Figure 2: Data dependencies for the computations at a, j and at ele- 

ments in subset 5 

The last observation follows from the fact that the computations require values from k r 
rows as well as from the k c rows that correspond to the Jfe c columns. However, some of the 
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k T rows and the k c rows may overlap. In addition, note that if j* is the leftmost of the k c 
columns then at least all the values in columns 1 through j 1 on max(i r , k c ) rows are required 
in the computation of the elements in the subset under consideration (see Figure 2). 

The above observations and the result established in the following lemma are used to get a 
lower bound on the data traffic in computing the Cholesky factor. 


Lemma 2 Let IT be the amount of computational work which is to be distributed uniformly 
among p processors and let a be any constant less than one. For any subset of this compu- 
tation consisting of W/2 amount of work , there are at least (1 - a) • p/(2 - a) processors 
each assigned a ■ W /2p or more work from that subset. 


Proof : The work is uniformly distributed among p processors and hence each processor 
is assigned W/p amount of work. Now let 5 be a subset consisting of IT/2 amount of 
computational work. All p processors may be assigned some portion of work from S. Let 
Wi be the computational work from 5 assigned to processor p,-, where 0 < w{ < W/p. 
Therefore, 

A W 

1=1 1 

and W/2p is the average work from 5 performed by each processor. Thus, there is at least 
one processor that is assigned W/2p or more work from 5. Let a be a constant less than 
one and suppose that x processors are assigned at least a • IT /2p amount of work from 5. 
Each of the x processors may be assigned at most W/p work from 5. Now there are p — x 
processors that compute less than a ■ W j2p amount of work from 5. Therefore, 

IT a -IT 
— * x + ~ — 

V 

Solving the inequality for x we get, 


, x IT 
(p-*)> y- 


X > 


1 — a 

2 - a 


Thus, there are at least (1 - a) • p /( 2 - o) processors each computing a • W /2p or more 
amount of work from 5. I 


In the following theorem a bound on the data traffic associated with computing the Cholesky 
factor of an m x m matrix is established. Note that the result holds under stronger conditions 
than required by the model of computation assumed here. 


Theorem 2 Let A be a dense, m x m symmetric positive definite matrix that resides in the 
common memory. If the computational work is uniformly distributed among p processors, 
then the data traffic involved in computing the Cholesky factor of A is For p > 6, 

the data traffic is fi(m 2 ■ yfp) even if the initial values of matrix A are in the processor local 
memory before the computation begins. 
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Figure 3: Data traffic associated with factoring of elements in region 

FGH 

Proof : Suppose that matrix A is initially stored in the common memory. The initial value 
of each element is fetched at least once by the processors for computing the factor. Thus, 
at least m 2 / 2 amount of data traffic is associated with the Cholesky factorization. Hence 
when the number of processors is a constant, there is nothing to prove. 

In the following it is proved that even if the matrix A is distributed initially among the 
processors according to their work assignment, the data traffic is n(m 2 • p) when the 
number of processors is greater than 16(a + 4/a - 4)/3 for any constant a less than one. 
Since there exists an a less than one such that 16(a + 4/a — 4)/3 is less than six, the proven 
result holds for all p > 6. Consider the computations corresponding to the elements in the 
set 5 = {a tJ - | i,j > m/2}. In Figure 3 the region FGH denotes this set of elements. The 
total computational work in factoring the m x m matrix is m 3 / 6 + m 2 / 2 + m/3 and that 
corresponding to the elements in set 5 is m 3 / 12 + m 2 / 4 + m/6. Thus, the amount of work 
associated with 5 is exactly half of the total work. If a is a constant less than one, then 
from Lemma 2, there are at least (1 - a) • j>/( 2 - a) processors each computing at least 
a • m 3 /12p amount of work from the region FGH. 

Let II be the set of processors each with at least a-m 3 /12 p amount of work from the 
region FGH and let p, € II. Now the computational work associated with any element in 
FGH is at most m (work corresponding to element a miTn ). Therefore at least a-m 2 /12p 
elements in the region FGH are assigned to processor p,. Let x be the number of rows 
on which the elements assigned to processor p,- lie. This implies that there are at least 
[(a • m 2 )/( 12 p • z)] columns on which the elements assigned to p, lie. Therefore, from the 
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Observation (iii) above, data from at least max(r, [(a • m 2 )/( 12 -p • z)"|) number of rows in 
the region DFGE are required for completing the computations from region FGH assigned 
to processor p,. Without loss of generality, assume that the quantity 12 p ■ z divides a • m 2 
evenly. Now, the quantity ma.x(r,(o' ■ m 2 )/( 12 -p • x)) is minimum when x = y/a • m j >/l2p. 
Thus, the computations in processor p,- require at least all the values of the elements on 
the y/a • m/ v /l2p rows in region DFGE. In region DFGE each row has m/2 elements and 
thus the values of at least y/a • m 2 /( 4 • \/3p) elements from the region DFGE are needed 
in processor p, for performing computations in the region FGH. 

Now processor pj may also be assigned some work from the region DFGE in addition to that 
in FGH . Hence to complete the proof, it is necessary to show that of the y/a ■ m 2 /( 4 • y/2p) 
elements needed by processor pi at least c • m 2 / y/p elements are not available locally, where 
c is a constant less than one. In that case the data traffic associated with processor p, is 
at least c • m 2 / y/p and since there are at least (1 - a) *p/( 2 - a) such processors, the total 
data traffic in computing the Cholesky factor of an m x m dense matrix is Q(m 2 * v / p ). We 
complete the proof by showing in the following that p, accesses at least c • m 2 / y/p non-local 
elements from region DFGE for completing the computations in the region FGH. 

Processor p, is assigned at least a • m 3 /12p amount of work from the region FGH . Since 
each processor is assigned m 3 /6p amount of work (the uniform load distribution condition), 
Pi performs at most (2 - a) * m 3 /12p amount of work in the region DFGE. The data traffic 
associated with processor p; in completing the work in the region FGH is a minimum 
when all the elements from region DFGE assigned to p, lie on the y/a • mjyj\2p rows. 
Furthermore, to reduce the data traffic, as many elements on these rows as possible should 
be assigned to processor p t . Now the computational work corresponding to any element 
dij is ;; that is the work associated with an element on the leftmost column of the matrix 
is the smallest and it increases for elements on any row from left to right. Therefore the 
data traffic associated with pi is a minimum when it is also assigned the computational 
work corresponding to the elements in the leftmost columns on the chosen rows of region 
DFGE. Let k be the number of the leftmost columns on which the the elements from region 
DFGE that are assigned to p, lie. The shaded region shown in Figure 3 corresponds to the 
elements which minimize the data traffic for processor p,*. Since processor p, performs at 
most (2 - a) • m 3 /12p amount of work in DFGE, the condition on k is given by, 

y/a * m . (2 - a) • m 3 

" 12 p 

2 Y y/a y/Zp 2 

It can be verified that there is a constant /? greater than one, such that if p is greater 
than 16(a — 4 + 4/a)/3, then the right hand side of the above inequality is at most 
m/2/? for all values of m. This gives a bound on k. Therefore work corresponding to 
at most ( y/a • m 2 )/( 4/? ■ -/ip) elements in the region DFGE may be assigned to processor 
Pi which will minimize its data traffic for the computation in the region FGH. Hence of 
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the (yfa • m 2 )/(2 ■ yj 1 2 p) elements needed by processor for completing the computation 
in the region FGH , at least (1 — 1 / /?) * • m 2 /(4y3p) elements are not available locally. 

Thus, if the number of processors, p, is greater than 16(a - 4 + 4/a)/3 for any a less than 
one, then the data traffic associated with processor pi is at least c • m 7 /y/p for some con- 
stant c less than one. Since there are at least (1 - a) *p/( 2 - a) such processors, the result 
follows. 1 


2.5 Remarks on the BLOCC Scheme 


Assuming that each step of the innermost loop in the Cholesky decomposition costs one 
computational time unit and ignoring the costs associated with other steps, the sequential 
computation time for factoring the m xm matrix A is m 3 /6 4- 0(m 2 ). The BLOCC scheme 
described above has a computation time of m 3 /2 p + 0(m 2 /p), where p is the number of 
processors used. As shown in Theorem 1 the associated data traffic is less than y/2-m 7 -y/p/3. 
Thus, the time and the data traffic complexities of the BLOCC scheme are optimum in an 
order of magnitude sense. However, the computational load in the BLOCC assignment 
scheme is not perfectly balanced. The processors that compute elements in the partitions 
that are towards the left side of the matrix L finish computation earlier than those that are 
on the right. This balance may be improved in several different ways, but at the cost of 
increasing the data traffic. In one such scheme the columns of the matrix are assigned to 
each processor in a wrap around fashion; that is, columns i,p + i, . . . , m - p + i are assigned 
to processor i. All the elements on any column of L are computed by a single processor. 
Let this assignment scheme be referred to as the wrap around assignment scheme. In this 
scheme the computation is distributed more evenly among the processors than that in the 
BLOCC scheme. The computation time is reduced to m 3 /6p + 0(m 2 ), provided m is at 
least p(p + 3)/2. However the data traffic associated with this scheme is m 2 • p/2, which 
is suboptimal. In [11] and [2] the wrap around assignment scheme is recommended as 
a preferred method for computing the factor on a multiprocessor system because of its 
good load balancing properties. Their analysis does not take into account the cost of the 
associated data traffic, which must be taken into account for reducing the overall execution 
time. 


3. Parallel factoring of sparse, symmetric positive definite 
matrices 


In this section a partitioning and assignment scheme is presented that computes the factors 
of an n x n matrix, associated with a 2-d regular grid graph, using a < 1, processors 
with a total data traffic of 0(n ] + a ! 7 ). Then it is shown that the data traffic in factoring the 
matrix is ft(n 1+ °/ 2 ) when the load is distributed uniformly among n a processors, a < 1. 
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It is also shown that in any scheme that requires n a processors, a > 1, the data traffic is 

0(rc 3 / 2 ). 


For factoring a sparse matrix efficiently proper ordering of the matrix is essential. Ordering 
of the matrix to be factored also determines the data dependencies and hence the data traffic 
associated with any partitioning and assignment scheme. For matrices associated with 
regular grid graphs, nested dissection is a well known ordering scheme [6]. In the following 
a few basics of this ordering scheme are briefly described and some notation is introduced 
that is necessary for the analysis presented later. In the following it is assumed that the 
reader is familiar with the elementary concepts underlying the nested dissection algorithms, 
and the terms such as elimination order and the fill associated with the elimination process. 
It is also assumed tha t the reader is familiar with the basic graph theory concepts related to 
matrix representations of systems of equations, in particular, the notion of vertices, edges, 
separators, subgraphs of a graph, and the correspondence between the vertices and the 
rows and columns of the matrix, between the edges and and the non-zero elements, and the 
added edges and the fill-in during the factorization of the matrix. For details see [12] and 
[6] and the references therein. 


3.1 Nested dissection method as applied to 2-d grid graphs 


A nested dissection method may be viewed as a divide-and-conquer algorithm on an undi- 
rected graph. It relies on finding a small set of vertices, called the separator set, in the graph 
such that the removal of these vertices divides the graph approximately in half. Informally, 
the nested dissection method orders the vertices of the graphs as follows. The vertices in 
the separator set are ordered last. Then the vertices in the subgraphs obtained from the 
original graph by removing the separator are ordered recursively. In [12] a nested dissection 
algorithm is given for ordering the vertices of any graph G such that G and all subgraphs 
of G satisfy a ^/ra-separator theorem. The ordering produced by this algorithm guarantees 
a O(nlogTi) fill and 0(n 3 / 2 ) sequential operation count for a system corresponding to an 
rc-vertex graph G. In [8] a nested dissection algorithm is given for ordering the vertices of 
a graph G that has a ^/n-separator decomposition.* For a detailed treatment of the nested 
dissection methods and for the relevant practical applications see [6]. 

For the sake of simplicity and clarity, here only the systems corresponding to y/n x y/n 
regular grid graphs are analyzed. However, the techniques developed for analyzing data 
traffic complexities are applicable to other systems where the nested dissection method can 
be used to give a “good” ordering. In the following, the nested dissection method used 
for ordering the vertices in a y/n X y/n regular grid graph is briefly described. In the 
discussion, the grid graph is sometimes simply referred to as the grid and a subgraph of the 
grid graphs is referred to as a subgrid. For the rest of the discussion, it is also assumed that 

*A graph G is said in have a y/n -separator decomposition for constants a < 1 and > 0 if G has a 
v/n-separator C and every connected component of G~- C has a \/u-separator decomposition. 
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Figure 4: A 7 x 7 grid with nested dissection ordering 

the vertices of the grid are connected according to a 9-point stencil, unless otherwise stated. 
Let V be the set of vertices of a y/n x y/n regular grid graph. Without loss of generality, 
assume that y/n = 2* — 1 for some integer /. Let So be the set of 2 l — 1 vertices on a vertical 
mesh line, the removal of which partitions V into two subgrids, V\ and such that the 
vertices of both the subgrids are arranged in a (2 l - 1) x (2*~ 1 — 1) mesh. The vertices of 
So are numbered from n — y/n + 1 to n in any order. Suppose that \\ is the left subgrid 
and V 2 is the right subgrid. Let Si be the set of vertices on a horizontal mesh line that 
divides Vi into two equal parts each containing (2 ,_l - l) 2 vertices that are arranged along 
a (2 /_1 - 1) x (2 /_1 — 1) square mesh. Similarly, let 52 be the set of vertices from l 2 which, 
when removed, produce two equal halves from V 2 . Both Sy and 52 contain 2 ,_1 — 1 vertices. 
Let the vertices in 5 1 be numbered from n — 2^71 + 2 to n — 3y/n/2 + 1/2 and those in 52 
be numbered from n — Zy/n/2 + 3/2 to n — y/n. Thus, the removal from V of the vertices 
in the set 5(^5! [j 5 2 partitions V into four (y/n - l)/2 x ( y/n — l)/2 subgrids. The 
separator set 5 q U-^i U ^2 is referred to as the “+” -separator for the grid corresponding to 
V . The middle vertical part of the “+”-separator is referred to as the vertical sub- separator 
and each of the two horizontal halves of the “+”-separator is referred to as the horizontal 
sub-separator. All the vertices of the four subgrids are numbered by recursively identifying 
and ordering the vertices on the “+”-separators of the subgrids induced by the vertices 
ordered so far. The recursion stops when a subgrid has only one vertex on it. For any 
w +”-separator, there is a vertical sub-separator and two horizontal sub-separators. With 
the above described ordering scheme, for any given “+”-separator, the vertices on the two 
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horizontal sub-separators are given numbers that are smaller than those assigned to the 
vertices on the corresponding vertical sub-separator. Thus, we say that the vertices on 
a horizontal sub-separator are ordered ahead of the vertices on the corresponding vertical 
sub-separator or that the vertices on the vertical sub-separator are ordered after those on 
the horizontal sub-separators. An example of ordering the vertices in a 7 x 7 grid is shown 
in Figure 4. Observe that the grid is recursively partitioned into four subgrids by a set of 
vertices that form a ’’-separator. 

To label the subgrids and the separators of the grid graph, we use the notation given in 
[7]. Each subgraph and the separator that induces the subgraph are given a level number 
depending on the recursion level of the nested dissection on which the subgraph is ordered. 
Under this scheme the original grid is called a level - 0 (sub)grid. The four subgrids of size 
(y/n — l)/2 x (y/n - l)/2 are the level-1 subgrids. The “-(-’’-separator that partitions the 
level-0 grid into the four level- 1 subgrids is called the level- 1 “-(-’’-separator or simply as the 
level-1 separator. Thus, if n is equal to ( 2 l — l) 2 , there are / levels of subgrids numbered 0 
through / — 1 and / - 1 levels of separators, numbered 1 through / — 1. 

In the following it is assumed that the matrix to be factored is ordered using the nested 
dissection scheme and that the symbolic factorization step is already completed. 


3.2 Cholesky factorization scheme revisited 


Consider the Cholesky factorization scheme described in Section 2.1 for factoring a sparse 
symmetric, positive definite matrix. 


Clearly, the main difference between factoring a sparse and a dense matrix using the 
Cholesky factorization scheme is that in the former case there is no need to modify col- 
umn j by all columns to the left of it. Specifically, column j is modified only by columns 
k for which lj ^ ^ 0. Moreover, if column k modifies column j, only the nonzero elements 
of column k need to be fetched. Exactly which elements are needed is formalized later. In 
Figure 5(a), the zero-nonzero structure of L ) corresponding to the vertices of the separators 
on the first two levels, is shown schematically. The shaded areas represent the nonzeros. 
The corresponding grid is shown in Figure 5(b). It is clear from the figure that only certain 
values from certain columns are needed for computing an element of the factor. 

Another important difference is that, because of the ordering applied, several columns may 
be computed simultaneously. As stated earlier, column * and row j of the matrix corresponds 
to a vertex V{ in the elimination graph and the factoring of the matrix corresponds to the 
elimination of the vertices. Thus, all the vertices on the level /-I subgrids may be eliminated 
simultaneously followed by those on the level / — 2 and so on. This observation is useful in 
extracting parallelism in the factorization step. 


13 


\ 



r a 

r 4 

r* 

r i 

r i2 

r « 

r. 



r. 

r in 

r aj _ 




£ 


h 


( * ) 


( b ) 


Figure 5: Structure of L 

3.3 The worst case data traffic complexity 


In this section a bound on the worst case data traffic complexity for factoring the matrix A 
is established. Clearly, the communication requirement is the worst when the use of local 
memory is not allowed. Thus, an upper bound on the worst case data traffic is obtained 
by assuming that the values of all the elements of the lower triangular part of matrix 
A and those of L, as well as any intermediate results, are stored in the shared memory. 
Suppose also that any number of processors are allowed to participate in computing a 
nonzero element of the factor provided that no computation is repeated. Consider the 
computations associated with a nonzero element € L. Recall that in computing /,j, first 
a ti j — ]l1=i h,k'lj,k is evaluated and then the resulting value is divided by Ijj . Thus, for each 
multiplication, there is one subtraction operation, at most one division and three memory 
references and a constant overhead such as index computation. Therefore, in the worst case, 
each multiplication operation in the Cholesky factorization is associated with a constant 
amount of data traffic. The following theorem gives a bound on the worst case total data 
traffic. In the proof of the theorem, the result given in Theorem 8.1.8 of [6] is assumed. 
That theorem states that the number of operations required to factor a matrix associated 
with an n-vertcx 2-D grid ordered by nested dissection is given by 829n 3 / 2 /84 + 0(n -log n). 
Although the following result is obvious, it is useful because it is independent of the number 
of processors used and it gives the worst case bound on the data traffic even for the models 
of computation that are more restrictive. 
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Theorem 3 The worst case data traffic associated with factoring the matrix A is 0(n 3 / 2 ). 


Proof: Associated with each multiplication operation in the factorization there are at most 
a constant number of memory references. Suppose that k memory references are involved 
per multiplication. Thus, the total data traffic is 

< k • number of multiplication operations. 

Now, the number of multiplication opera tions associated with factoring matrix A is 0(n 3 / 2 ) 
[6]. Hence, the worst case total data traffic is 0(n 3 l 2 ). I 

Note that the above theorem is applicable to all the graphs for which a v/n-separator 
theorem holds. 


3.4 Data dependencies for the sparse Cholesky factorization 


The worst case bound on the data traffic established in Theorem 3 can be improved for the 
model of architecture assumed in the case of the dense matrices. In that model, no element 
is fetched more than once from the shared memory and hence the values of the elements used 
in more than one operation are stored in the local memory associated with the processor. 
To maximize the potential of such a model, it is necessary to clearly understand the data 
dependencies involved. The vertices of the grid are ordered using the recursive nested 
dissection scheme. Hence it is sufficient to investigate the data dependencies involved in 
computing the elements of L in the columns corresponding to the vertices in a generic 
“-f ’’-separator. This is accomplished in the next two lemmas. 

Let rj? = {Jt|)b < j and / t) * ^ 0,/ lt * € L}\ i.e., i)\ is the set of all columns of the factor L to 
the left of the column jF + 1 such that the elements in row i of these columns are nonzero. 
Let fjj k = U?=i i vi'j i- e *> vlk set fc he columns to left of column ; + 1 such that on 

each of these columns there is a nonzero element in at least one of the i through k rows of 
the factor. Let T represent any m-vertex sub-separator. It is assumed that all the vertices 
in any sub-separator are ordered consecutively. Let /ow;(T) and high(T) be the indices of 
the lowest and the highest ordered vertices, respectively, on the sub-separator T. Note that 
high(T) - low(T) + 1 = m. In Figure 6, a sub-separator T is shown. This sub-separator 
separates the vertices in regions R\ and /? 2 * The diagonal and off-diagonal non-zero blocks 
associated with this sub-separator are shown in Figure 7. 

The following lemma establishes some basic sub-separator related properties that are useful 
in analyzing the communication requirements. 


Lemma 3 Let T be any m -vertex sub- separator, (i) Corresponding to the vertices ofT there 
is a dense m x m triangular diagonal block in the Cholesky factor, (ii) 7n the factor L, the 
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Figure 6: Sub-separator T with four surrounding sub-separators 

columns low(T) through high(T) contain at most four off-diagonal rectangular blocks with 
nonzero elements. Each of these blocks is of size at most (ci ■ m 4- C 2 ) x m where ci < 2 
and c? < 3 are positive integer constants. Any nonzero element in these columns is either 
in one of these four blocks or in the diagonal triangular block. 


Proof: The first part of the lemma is obvious. In Figure 6, the sub-separator T separates 
the vertices in regions Ri and Ri- Since the vertices in these two regions are ordered ahead 
of those of T, the fill due to the elimination of vertices in regions Ri and Ri ensures a dense 
m x m triangular diagonal block bounded by columns low(T) and high(T) as shown in 
Figure 7. 

To prove the second part of the lemma, again consider Figure 6. In that figure, the thickness 
of the lines qualitatively indicates the separator levels in the nested dissection ordering. Let 
Ti, T 2 , T 3 , and r 4 be the four partial sub-separators that surround the sub-separator T. 
Because of the nature of the nested dissection ordering, the vertices of T are “connected” to 
only those higher ordered vertices that lie on T 1 , r 2 , Tj, and r 4 and to no other vertices. ^ 
Thus, all the nonzeros on columns /ow(r) through high(T) in rows below high{T) are 
confined to only the rows correspond ing to the vertices on Tj, r 2 , Tj, and T 4 . Furthermore, 

* Vertex « is said to be “connected” to vertex t> if there exists a path ■},...,«*,»] of length one 

or more in the grid graph such that inrfcx(« r ) < minfini/ex(«), index(v)), for 1 < r < t; in such a case, 
li,i £ L i» » non-zero, where « = in«fex(«),ji = tndex(v). 
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each vertex in T is “connected” to every vertex on these four partial sub-separators and 
hence the four rectangular blocks are dense. This is shown schematically in Figure 7. It 
can be verified that if T is a horizontal m-vertex sub-separator, then the surrounding box of 
vertices is of dimension (2m + 3) X (m + 2). Therefore there are two rectangular off-diagonal 
dense blocks of dimension at most (2m + 3) x m and the other two of dimension at most 
(m + 2) x m. Similarly, if T is a vertical m-vertex sub-separator, there are four off-diagonal 
rectangular blocks of dimension at most (m + 2) x m in the factor. If T is not surrounded 
on all four sides then some of these blocks will be missing. I 



Figure 7: Off-diagonal blocks with nonzeros corresponding to 

sub-separator T 


From the above lemma it is clear that, in computing the nonzero elements in the columns 
corresponding to the vertices on a sub-separator, only the data dependencies of the elements 
in the four rectangular blocks and the diagonal triangular block need be considered. This 
is accomplished in the following lemma where a bound is derived on the amount of data 
required in computing the nonzero elements lying on a given tow and on one of the five 
blocks. The lemma shows that the number of nonzero elements in any row t of the factor 
L is less than c • m where c is an integer constant and m is the size of the sub-separator 
to which the vertex corresponding to row i belongs. It is then shown that, for any row i, 
the computations at all the elements € L, low(T) < j < high(T ), for some m-vertex 
sub-separator T, require a total of less than c ■ m nonzero elements from that row. Note 
that this count is independent of the sub-separator to which the vertex corresponding to 
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row i belongs. Thus, the computations at all the elements in a row of any of the five blocks 
specified in Lemma 3 require only c*m elements from that row, irrespective of the relative 
location of the off-diagonal blocks in the factor. 


Lemma 4 Let T be any m -vertex sub-separator. The nonzero elements from row i, i > 
loxv(T), required in completing the computations of all elements /,j £ L such that low(T) < 

j < high(T) } are those elements in row i on the columns in the set given by, V^wl^highir) H ty l0h ^- 

For all i greater than or equal to low(T), H T ^ xgh ^^ || is at most c-m for some 

constant c. 


Proof : Any nonzero element /,j E L, i > low(T) and /ou;(r) < j < high(T), is in one of the 
five blocks specified in Lemma 3. Hence, to prove the result of this lemma, only the rows 
that intersect one of these blocks need to be considered. The result for low(T) < i < high(T) 
is proved first followed by that for i > high(T). 


hi9h{V) since n hi9k{r) C n high{r) 

, since, c. V low ( r ^ hi \ g h(ry 


When low{T) <i < high{T), tfZ\r)!high(r) H V-' 9h(r) = V-' 

By definition, the set r^ xgh ^ T ^ contains all the columns that have a nonzero element in row 
i . Clearly, the nonzero elements from the row i required in completing the computations at 
all the elements Uj G L y low(T) < j < high(T), are on columns in the set rj^ gh ^ T \ 


To measure the size of the set rj^ tgh ^ r \ note that it is bounded by the number of vertices 
ordered ahead of the vertex * and which are “connected” to vertex t. Using the recursive 
nature of the nested dissection ordering it can be verified that in the case of constant degree 
grid graphs and when low(T) < i < high(T), the size of the set r^ xgh ^ i s bounded by c*m, 
where c is a constant dependent both on the degree of the graph and on whether T is a 
horizontal or vertical sub-separator. If T is a horizontal m-vertex sub-separator then, for 
a 5-point stencil, c is equal to 7 and, for a 9-point stencil, c is equal to 11. When T is a 
vertical m-vertex sub-separator, the values of c are 5 and 7, respectively. This completes 
the proof when low(T) < i < high(T). 


The case where i > high(T ) is considered next. As shown above, ||^ l ^ /l ( r )|| depends on the 
size of the sub-separator to which the vertex i belongs and hence, when i > high(T) } 
Wrii'gh^W can be much greater than O(m) where m is size of T. However, when the 
computation of only those elements in row i that lie on columns /ow(r) through high(T) are 
of concern, each of these computations consists of a product of a nonzero element in row i 
and a nonzero element in one of the rows low(T) through high(T) in the column high(T) or 
in some other column to the left of it. Thus, for these computations, only the columns that 
have a nonzero element in row i and in row j, where /flw(r) < j < high(T ), are of interest. 
The set V^l{i T )\i g h( r) cons i sts °f columns that have a nonzero element in at least one 

of the rows low(T) through high(T). Similarly, ^*^( r ) consists of all the columns that 
have a nonzero element in row i. Clearly, the set ^ rjj^ xgh ^ consists of all the 
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columns which contain all the pairs of nonzero elements that must be used in completing the 
computations at all the elements /, j, loiv(T) < j < high(T). Thus, the nonzero elements 
from row i, i > high(T), required in completing computations at all the elements /,j € L 
such that low(T) < j < high( T), are those elements in the row i on the columns in the set 

. , -hiqh(T) ^ high(T) 

given by D Vi 


To get a bound on the size of V^vlcr) iipA(r) H Vi ' 9>l r > consider the m- vertex horizontal 
sub-separator T shown in Figure 6. It is surrounded by sub-separators Ti, T2, T3, and T^ 
Suppose that /ow(Fi) < i < high(Ti). The set V^ow(f^,htgh(r) fl consists of columns 

corresponding to vertices on T or corresponding to those vertices ordered ahead of them 
which are “connected” to at least one vertex in T and to the vertex corresponding to row i. 
Using the recursive ordering of the nested dissection scheme it can be shown that the number 
of such vertices is less than 7m. Thus, ||?^(r^j»i0fc(r) ^ < 7 m, for low(Ti) < i < 

high(Ti). The same bound is obtained when low(Ti) < i < high(Ti). If /ow(r2) < i < 
high(T 2 ) or low(T 3 ) < i < high(T 3 ) then it can be verified that, H T7 /l * ff/l(r) || < 

3m. If T is vertical sub-separator the two bounds are 5m and 5m/2 respectively. I 


3.5 A partitioning scheme with minimum data traffic 


In this section a partitioning scheme for computing the factor of the sparse matrix A is 
described. Suppose that an n x n matrix is to be factored using n a processors, a < 1. 
The vertices of the y/n X y/n grid graph corresponding to this matrix are ordered using 
the nested dissection method described earlier. Assuming n = (2 1 — l) 2 , the ordering results 
in l levels of subgraphs and / — 1 levels of “-(-’’-separators. If the original y/n x y/n grid 
is considered to be on level 0, then on level i there are 2 2 ‘ level-i subgraphs each of size 
(2 ,_ * — 1) x (2 ,_I — 1). Without loss of generality, assume that a •/ is an integer. Thus, in 
the partitioning scheme described here, all the vertices on a level-a/ subgraph are assigned 
to the same processor. In that scheme, initially each processor independently computes the 
elements in the factor corresponding to a (2( 1-ra ) , — 1) x (2( 1-a ) ' - 1) subgraph which are 
separated from one another by the level-a/ separators. Once the elements in the columns 
corresponding to the vertices on the level-(/ - 1) through level-a/ separators are computed 
locally, a processor Pi combines with three other processors to compute the elements on the 
columns of L corresponding to the vertices on the level-(a/ — 1) “-(-’’-separator. The two 
horizontal sub- separators are computed by two processors and the vertical sub-separator of 
that level is computed by all four processors. The next lower level “+” -separator is computed 
in parallel by sixteen processors from the four neighboring groups. This is continued until all 
the vertices are elimina ted. On each level of computation each group of processors computes 
the elements of the factor independent of the other groups. The elimination of the vertices 
on the vertical sub-separator of level-1 is computed in parallel by all processors. This 
corresponds to factoring a \/n x y/n dense matrix. The computations corresponding to the 
level-i separator, i < a l, are performed as follows. The computations corresponding to the 
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vertices on level-(o/ — k ) “+ ’’-separator, 1 <k<al, arc completed by p = 2 2 k processors 
working in parallel. Using all the available processors, the factorization corresponding to 
the m x m triangular diagonal block is first completed. Then the processors are used to 
compute the elements corresponding to the four off-diagonal blocks. For the first part, the 
BLOCC factorization scheme described for the dense matrices is used. The m x m dense 
diagonal block is partitioned into r 2 / 2 — r/2 square blocks and r diagonal triangular blocks 
each of size m/r x m/r where p = r 2 /2-f r/2, and each of these p partitions is assigned to a 
unique processor. Each processor completes the computations corresponding to its partition 
by accessing the required data from the shared memory. For the purpose of factoring, the 
off-diagonal blocks are treated as if they were adjacent, and the resultant rectangular block 
is partitioned into p sub-blocks each of size c • m/y/p x m/y/p, where c < 6 for a horizontal 
sub-separator and c < 4 for a vertical sub-separator. Again each partition is assigned to a 
separate processor. This process is repeated on the next lower level “+”-separator. Thus, 
in the assignment scheme described here, each processor is assigned a new subblock on each 
level and the size of the subblock assigned to a processor varies from one level to the next. 
Let this partitioning scheme be referred to as the sparse block oriented column Cholesky 
factorization scheme or simply as the sparse BLOCC scheme. Note that the underlying 
numeric algorithm is the column oriented Cholesky factorization. 


Data traffic associated with an m- vertex sub-separator 


The sparse BLOCC scheme, described above, may be considered as a sequence of steps, each 
step corresponding to the elimination of vertices on the “-(-’’-separators of some level. Ini- 
tially, a single processor computes all the non-zero elements corresponding to a “+”-separator 
in the factor. As the computation proceeds, more than one processor work together to com- 
pute the elements corresponding to a “+”-separator. On any such step, first the non-zero 
elements in the columns corresponding to the horizontal sub-separators are computed and 
then those in the columns corresponding to the vertical part are computed. Here we analyze 
the data traffic associated with any one step, on which p processors combine together to 
compute the elements corresponding to a sub-separator. 

By Lemma 3, for any sub-separator T there are at most five non-zero blocks in the columns 
corresponding to the vertices on T. The number of non-zero blocks is five when T is enclosed 
within a rectangular box formed by the sub-separators with vertices that are ordered after 
those on T (see Figure 6), The following lemma gives a bound on the data traffic associated 
with computing the elements in the columns corresponding to such sub-separators. Not all 
sub-separators are enclosed by such rectangular boxes. In such cases there are less elements 
to be computed and consequently there is less data traffic. For the sake of simplicity of the 
analysis, it is assumed that no element of the factor needed in the computation of the five 
non-zero blocks is initially in the local memory of any of the p processors. Thus, the data 
traffic given below is a conservative estimate. 
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Lemma 5 Let T be any m -vertex sub-separator and p be the number of processors available 
for computing the elements of the factor in all the non-zero blocks within the columns lov)(T) 
through high(T). If T is an m-vertex horizontal sub -separator, then the associated data 
traffic is at most (53 + ll\/2 )m 2 • y/p. If it is a vertical sub-separator, then the data traffic 
is at most (28 + 8\/2)m 2 • y/p. 


Proof: Let T be an m-vcrtcx horizontal sub-separator that is enclosed completely within 
a rectangular box formed by the sub-separators whose vertices are eliminated after the 
vertices of T. Such a sub-separator has the worst case communication requirements among 
all the m-vertex sub-separators. 

First, consider the data traffic associated with computing the elements of the factor in the 
triangular diagonal block using p = r 2 / 2 + r/2 processors. Each of the sub-blocks requires 
nonzero elements from at most 2m fr rows out of the m rows in the range /ow>(r) through 
high(T) of the factor. No other information is needed. From the proof of Lemma 4, 
each of these rows has at most 11m nonzeros. Thus the communication requirement of 
each partition is at most 11m ■ 2m/y/2p and the total communication requirement of the 
triangular block is bounded above by ll\/2 m 2 • y/p . 

Now consider the data traffic associated with the off-diagonal blocks. Each partition is 
of size 6m/ y/p x m/y/p . Thus, each partition requires nonzero elements from 6m j y/p rows 
which are below the row high(T) in the factor. From the proof of Lemma 4, each of these 
rows has at most 7m nonzeros that are useful in completing the computations in any of the 
partitions. Each partition also requires information from m/y/p rows from the region low(T) 
through high(T). Each of these rows has at most 11m nonzeros. Thus, the communication 
requirement of each partition is at most 7m • 6m /y/p + 11m - m/y/p — 53m 2 / y/p and the 
total communication requirement of completing the computations at the off-diagonal blocks 
using p processors is less than or equal to 53m 2 y/p. 

Adding the communication costs corresponding to the diagonal and the off-diagonal blocks 
we get the total data traffic associated with T to be less than or equal to (53 + ll\/2)m 2 - y/p. 

A similar analysis can be used to compute the data traffic when T is an m-vertex vertical 
sub-separator and can be shown to be bounded above by (28 + 8\/2)m 2 ■ y/p. I 


The total data traffic of the sparse BLOCC scheme 


Applying the results from the above lemma, a bound is obtained on the total data traffic of 
the sparse BLOCC scheme. First some notation is introduced. Let Th(m,p, k ) represent the 
data traffic using p processors in completing the computations at all the nonzero elements 
li j e L in the columns corresponding to an m-vertex horizontal sub-separator that is 


21 


surrounded by higher ordered vertices on k sides. Let r v (m,p, k) represent the same for an 
m-vertex vertical sub-separator. From Lemma 5, r^m^p, 4) is at most (53 + lly/2)m 7 • yjp 
and r„(ra,p, 4) is at most (28 + 8\/2)m 2 • ^fp. Let r g {m! \p, k ) represent the total data traffic, 
using p processors, in completing the computations corresponding to all the sub-separators 
within an m ; - vertex sub-grid that is surrounded by higher ordered vertices on k sides. Note 
that the quantities r ^ and r v represent the data traffic corresponding to the vertices on a 
horizontal and a vertical sub-separator, respectively, whereas r g represents the data traffic 
corresponding to the vertices on an entire sub-grid. 

The following theorem gives an upper bound on the total data traffic in factoring the 
matrix A associated with an n-vertex 2-D regular grid graph using n a processors with the 
scheduling scheme as described above. 


Theorem 4 The total data traffic in factoring the n x n sparse matrix A, using n a 
processors , is O(n l + O l 7 ); i.e., r g {n,n a , 0) = 0(n 1+a / 2 ). 


Proof : On an n 1 ! 7 x n l l 2 regular grid there is an n 1 / 2 - vertex vertical sub-separator 
and two n 1 ! 7 / 2-vertex horizontal sub-separators (ignoring the additive constant -1). The 
vertical sub-separator is not surrounded by any vertices that are ordered after the vertices 
on the vertical sub-separator. Each of the two horizontal sub-separators are surrounded 
by such vertices only on one side. These three sub-separators subdivide the n-vertex grid 
graph into four sub-grids of size n 1 ! 7 j 2 x n I ^ 2 /2, each surrounded on two sides by higher 
ordered vertices. Thus, the total data traffic in factoring the corresponding matrix A is 
given by, 

Tg(n, n a ,0) = r v (n 1/2 ,n“, 0) + 2 r h (^n 1/2 , ^n Q , 1) + 4 r 0 (^n, ^n a ,2). 

A recursive expansion of the above expression contains data traffic terms for vertical sub- 
separators of different sizes that are surrounded on zero sides, two sides, three sides (in two 
different ways), and on all four sides by higher ordered vertices. It also contains data traffic 
expressions for horizontal sub-separators of different sizes surrounded in five different ways. 
To keep the analysis simple, it is assumed that all the four sub-grids of size n 1 / 2 / 2 x n 1 / 2 /2 
are surrounded on all four sides. This simplification results in a conservative expression for 
the data traffic, but affects only the constant terms in the bound. Thus, 


Now, 


r g (n,n a ,0) < r v (n 1/2 ,n", 0) + 2 u(^n 1/2 , ^n a ,l) + 4 T g (^n, ^n a ,4). 


r 9 (^7i,^n a ,4) = r„(^n , / 2 ,in“,4) + 2r, l (^n 1/2 ,in 0 , 4 ) + 4r fl (^»i,^n“, 4 ). 


From Lemma 5, it follows that, 
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An analysis similar to that given in Lemma 5 yields 

r„(n l / 2 ,n 0 , 0) < 8>/2 -n 1+a/2 


and 


Thus, we get 


Th( -rc 1 / 2 , -n a 
V 2 ’2 


r g {n,n Q , 0) 


X)^ 22±25 v/2 wl+a/2 
8 

< 78 + 71 y/2 . n \+a/2 
2 


I 


3.6 A lower bound on the data traffic complexity 


In the following theorem, the communication bound of the sparse BLOCC scheme is shown, 
by giving a lower bound proof, to be optimal in an order of magnitude sense. 


Theorem 5 Under the condition of uniform load distribution, the data traffic in factoring 
the n x n sparse matrix A, using n a processors , a < 1, is 0(n 1+a / 2 ). 


Proof: For a regular 2-D grid graph with n vertices, the separator size for nested dissection 
ordering is n 1 / 2 [12]. From Lemma 3, it follows that the factor L has an n l t 2 x n 1 ^ 2 dense 
triangular diagonal block incorporated in it. Ftom Theorem 2, the data traffic involved in 
completing the computations associated with the elements of this dense triangular block, 
under the condition of uniform load distribution using n Q processors, is fl(n 1+a / 2 ). Since 
the factorization of A cannot be completed without completing the factorization of this 
dense block, the result follows. I 


From Theorem 4 and Theorem 5, it is clear that the load assignment scheme described 
here for factoring the n x n sparse matrix using n Q processors is optimal in an order of 
magnitude sense. Note that when n a y a > 1 , processors are used, the data traffic bound 
given in Theorem 3 holds. 


4. Concluding remarks 


In this paper we have analyzed the data dependencies in the Cholesky factorization of 
dense and sparse symmetric, positive definite matrices. The model of computation assumes 
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a multiprocessor system with a memory hierarchy. Based on this analysis it is shown that 
under the condition of uniform load distribution the data, traffic associated with factoring 
an n x n dense matrix is fi(7i 2+a / 2 ) when n a , a < 2, processors are used. The same 
is shown to be n(n, 1+a / 2 ), a < 1, for factoring an n x n sparse matrix representing a 
2-dimensional regular grid graph where the vertices are ordered using the nested dissection 
ordering methods. Block based partitioning schemes are presented that asymptotically 
achieve these bounds on the data traffic. 

The sequential computation time for factoring the n x n sparse matrix A is 829n 3 / 2 /84 + 
O(n logn) [6]. As stated for the dense matrix case, the assumption here is that the compu- 
tation cost of each step of the innermost loop is one and costs involved in the other steps are 
ignored. Under the same assumption, it can be shown that the computation time for the 
sparse BLOCC scheme is at most 283n 3 / 2 ” a /4 if n a processors are used. In [7], a parallel 
scheme for factoring the matrix A on a multiprocessor system is given that is analogous 
to the wrap around assignment scheme described in the Section 2.5 for dense matrices. 
This scheme has the property of distributing the work evenly among the processors. The 
computation time to factor the sparse matrix A on n a processors with the wrap around 
scheme is at most 1977i 3 / 2_a /4. However, the data traffic associated with that scheme is 
less than or equal to 183n 1+a /4. Note that the difference in the computation time with 
the BLOCC scheme and with the wrap around assignment scheme is less than a factor of 
two. The BLOCC scheme is able to compute the factor efficiently in the case of the sparse 
matrices because the processors are now assigned blocks in a wrap around fashion which 
tends to distribute the load evenly. On the other hand, the data traffic associated with the 
BLOCC scheme is an order of magnitude less than that for the wrap around assignment 
scheme. Moreover, in the former scheme, as many as n processors may be used before the 
total data traffic reaches the maximum value of 0(n 3 / 2 ), whereas in the later scheme only 
up to n 1 / 2 processors may be used efficiently. The implications of the reduced data traffic 
on the performance are as follows. 

The sparse BLOCC scheme reduces the communication requirement to 0(n l + a l 2 ) by re- 
moving the constraint of column-level indivisibility. Here the indivisible work unit is the 
computation corresponding to a nonzero element in the factor. The reduction in the com- 
munication requirements is brought about by improving the utilization of the data accessed 
from shared memory by each processor. Consider the factorization of an m X m dense 
matrix. Let the data utilization of a data element accessed by a processor be defined as 
the number of computations in which that element is used by that processor divided by 
m. Since an element in the factor is needed in at most m computations, the maximum 
utilization of any data accessed is one. Let the aggregate data utilization for a processor be 
defined as the average utilization of the individual data elements accessed by that proces- 
sor. In the BLOCC scheme applied to an m x m dense matrix, each processor accesses 
at most 2m 2 / y/p elements from the shared memory and each element is used in at least 
m/ yfp computations. Thus, the utilization of each data accessed is at least l/y/p and so is 
the aggregate utilization of all the data accesses. On the other hand, with the column-level 
work assignment scheme, each processor accesses 0(m 2 ) elements from the shared memory. 
Of these, only 0(m/p ) elements have a utilization of one and the data utilization for the 
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remaining elements is 1/p which gives an aggregate data utilization of approximately 1/p. 
Similar improvements in data utilizations are obtained in factoring a sparse matrix. 

It should be noted that the square shape of the submatrix partitions produce the best 
possible aggregate utilizations. For the algorithm considered here, the data dependencies 
are such that rectangular and square partitions give rise to high data utilizations. Since 
the square partitions have the minimum perimeter for a given area, the number of data 
elements accessed (which is proportional to the perimeter of the partition) for a given 
work load (which is proportional to the area enclosed), is also a minimum for the square 
partitions. 

An effect of the improvement in the aggregate utilization of data and the resulting reduction 
in the communication requirements is the segregation of the accesses to the shared data. 
Since the total data traffic in factoring an m x m dense matrix using p processors is 
0(m 2 • yfjp), on an average each processor accesses only 0{m 2 / yfp) data. Note that the 
total shared data, is 0(m 2 ). Thus, on an average each element in the shared memory is 
accessed by 0( v /p) processors. The column-level assignment scheme, however, has a total 
data traffic of 0(m 2 • p) and thus, on an average each processor accesses 0(m 2 ) data or on 
an average each element in the shared memory is accessed by O(p) processors. An obvious 
implication of this observation is that for the scheme presented here, not only is the total 
data traffic reduced but also the requests at individual shared addresses. This can have 
considerable impact on the performance of the systems with a large number of processors. 

As a final remark, note that the data traffic analysis for the sparse BLOCC scheme exploits 
the fact that the underlying graph satisfies a v/n-separator theorem. Thus, similar schemes 
may be developed for any class of graphs satisfying an /(n)-separator theorem [13]. In 
such cases the data dependencies, the fill, and the computation time depend on /(n). In 
[12] the fill and the bounds on the sequential computation time for various values of f(n) 
are listed. Here we state the bounds on the corresponding data traffic when the systems 
are computed using n a processors. The data traffic of factoring a matrix corresponding to 
an n-vertex 3-dimensional regular grid using n a processors is 0(n 4 / 3+o / 2 ). For that case 
the computation time is 0(n 2 ~ a ). In general, the total data traffic for computing a factor 
of a matrix corresponding to a d-dimensional grid is 0(n 2<r+a / 2 ), where a = 1 — 1/d. The 
computation cost is 0{n Za ~ (t ). For an n-vertex finite element graph* with no element having 
more than k boundary vertices, the total data traffic in factoring the associated matrix is 
0(k 2 • n 1+a / 2 ). The computation time is 0(k 3 * n 3 / 2-a ). All these quantities are optimal in 
an order of magnitude sense. 


*A finite element graph is any graph formed from a planar embedding of a planar graph by adding all 
possible diagonals to each face; i.e., there is a clique corresponding to each face of the embedded planar 
graph. 
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